Plotting in Python

Lecture 3

Introduction

  • I have met my second former student who lives by a “rule of two” so it is time to share.
  • Both are quite successful, I’ll share their rule:
    • Master two skills, one skill focused on data and the other skill focused on analysis / statistics.
  • The first person, says: “pandas and numpy”
  • The second person will present on Data Viz in business next month says “dplyr and ggplot”

Python

  • Has > 135K libraries
  • We will be using Pandas, Numpy, Matplotlib
  • also Seaborn
  • Not meant to be comprehensive but instead helping you to narrow it down to “two”.

The stats on Pandas, Numpy, Matplotlib & Seaborn

  • Developers Pandas maybe < 100 core and thousands of contributors/followers

  • Developers Numpy maybe < 50 core and a bit more than 1000 contributors/followers

  • Developers Matplotlib maybe < 40 core and < 1000 contributors/followers

  • Developers Seaborn like one core developer and maybe 50 contributors

The Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *
from plotnine.data import mtcars

Pandas origins and logic

  • The name “Pandas” comes from “Panel Data,” a social science term for data that include observations over multiple time periods for the same individuals—or panels.

  • Pandas core data structure is the DataFrame it is designed to handle and analyze structured data easily.

  • A DataFrame is just a table of data organized into rows and columns. Like an R data frame, a spreadsheet or a SQL table.

Basic Example

  • Like R, we can read csv (and many others)

  • pd calls up pandas and read_csv() is the specific function from pandas used to read csv files

  • The dataset can reside on the web

  • It is wise (not required) to specify an index column when we have (1) time series data or (2) unique IDs.

  • This makes grouping, lookups, joins faster and easier

Basic Example Code for AAPL

AAPL = pd.read_csv('https://raw.githubusercontent.com/lewv/S24STATS101A/main/data/AAPL.csv', index_col='Date', parse_dates=True) # function read_csv() from pandas

AAPL.head(3) #.head() is a method

Basic Example Result

Open High Low Close Adj Close Volume
Date
2019-01-02 38.722500 39.712502 38.557499 39.480000 37.845039 148158800
2019-01-03 35.994999 36.430000 35.500000 35.547501 34.075397 365248800
2019-01-04 36.132500 37.137501 35.950001 37.064999 35.530060 234428400

Basic Example Code for Ikea

ikea = pd.read_csv('https://raw.githubusercontent.com/lewv/S24STATS101A/main/data/ikea.csv', 
index_col='item_id') 

ikea.head(3)

Ikea data

Unnamed: 0 name category price old_price sellable_online link other_colors short_description designer depth height width
item_id
90420332 0 FREKVENS Bar furniture 265.0 No old price True https://www.ikea.com/sa/en/p/frekvens-bar-tabl... No Bar table, in/outdoor, 51x51 cm Nicholai Wiig Hansen NaN 99.0 51.0
368814 1 NORDVIKEN Bar furniture 995.0 No old price False https://www.ikea.com/sa/en/p/nordviken-bar-tab... No Bar table, 140x80 cm Francis Cayouette NaN 105.0 80.0
9333523 2 NORDVIKEN / NORDVIKEN Bar furniture 2095.0 No old price False https://www.ikea.com/sa/en/p/nordviken-nordvik... No Bar table and 4 bar stools Francis Cayouette NaN NaN NaN

Aside: classes, objects, functions, methods

  • Python is OOP, R is functional programming language.

  • A Pandas DataFrame in the abstract is a “class”

  • The Python objects AAPL and ikea are Pandas DataFrame realized or an instance of the Pandas DataFrame class.

  • Methods are defined within a class’s definition and are associated with specific objects.

  • Functions can be defined independently of classes and are not necessarily associated with any objects.

Basic Plot Using Matplotlib pyplot

CODE: Basic Plot with Matplotlib pyplot

If you are using Jupyter, you don’t need plt.show();, but here (using Quarto chunks) I do. If you are generating simple graphics.

plt.plot(AAPL.index, AAPL.Open)
plt.title(r'AAPL since 1/2019', fontsize=20)
plt.xlabel('Date')
plt.ylabel('Open (USD)')

plt.show();

Basic Scatterplot modify color and marker

CODE: Basic Scatterplot modify color and marker

color = np.log(ikea.price)

plt.scatter(x = ikea.height, y = ikea.width, 
c=color, marker='.', alpha=0.5)

plt.title(r'Ikea Product Height & Width', fontsize=20)
plt.xlabel('Height (cm)')
plt.ylabel('Width (cm)')

plt.show();

Basic Plot make it a line

CODE Basic Plot make it a line

plt.plot(AAPL.index, 
          AAPL.Open, 
          color='purple', 
          linestyle='-', 
          linewidth=0.5)

plt.title(r'AAPL since 1/2019', fontsize=20)
plt.xlabel('Date')
plt.ylabel('Open (USD)')

plt.show();

Or many lines

CODE Or Many Lines

plt.plot(AAPL.index, AAPL.Open, color='purple',
linestyle='-', linewidth=0.25, label='Open')

plt.plot(AAPL.index, AAPL.Close, color='blue', linestyle='-', linewidth=0.25, label='Close')

plt.plot(AAPL.index, AAPL.High, color='green', linestyle='-', linewidth=0.25, label='High')

plt.plot(AAPL.index, AAPL.Low, color='red', linestyle='-', linewidth=0.25, label='Low')

plt.legend()

plt.title(r'AAPL Open/Close/High/Low ', fontsize=20)
plt.xlabel('Date')
plt.ylabel('US Dollars')

plt.show();

Histogram

Code: Histogram

n_bins = 30

plt.hist(AAPL.Close, bins = n_bins)

plt.title(r'AAPL Close (USD) histogram', fontsize=20)
plt.xlabel('Close (USD)')
plt.ylabel('Frequency')
plt.show();

Histogram - density

Code: Histogram - density

n_bins = 30

plt.hist(AAPL.Close, density = True, bins = n_bins)

plt.title(r'AAPL Close (USD) density', fontsize=20)
plt.xlabel('Close (USD)')
plt.ylabel('Density')

plt.show();

Matplotlib, Pandas, Seaborn

  • You will always need Matplotlib

  • But Matplotlib is not as well suited for statistical graphics (not the “core” mission)

  • Contrast Seaborn is all about statistical graphics

  • Pandas is all about data, has some statistical graphic capability but it is much more widely supported than Seaborn

Some Comparisons - Pandas

  • Integrates graphics with its dataframes well so it is quicker to learn to use and remember.
  • Does not allow a lot of customization, faceting would be easier with Seaborn for example
  • But its basic plots not as visually appealing

Some Comparisons - Seaborn

  • Designed for statistical graphics

  • Basic plots are more appealing than Pandas

  • Easier to facet

  • Has more complicated options than Pandas

Some Comparisons - Matplotlib

  • Requires a lot of effort to make a simple statistical plot

  • But offers maximum customization

  • The primary foundation for all graphics in Python

  • Basic visual appeal not as nice as Seaborn but can be customized to be nicer (with much more work)

Basic Boxplot in pandas

Code: Basic Boxplot in pandas

# Create a boxplot
ikea.boxplot(column="price", by="category", figsize=(14, 6))

# Set titles and labels using Matplotlib
plt.xlabel('Category')
plt.xticks(rotation = 45, fontsize = 7, ha = 'right')
plt.ylabel('Price')
plt.title('Comparison of Price Across Category')

plt.show();

Basic Boxplot in seaborn

Code: Basic Boxplot in seaborn

# adjust figure size
plt.figure(figsize=(14, 6))

# Create a boxplot
sns.boxplot(x='category', y='price', hue='category', data = ikea)

# Set titles and labels using Matplotlib
plt.xlabel('Category')
plt.xticks(rotation = 45, fontsize = 7, ha = 'right')
plt.ylabel('Price')
plt.title('Comparison of Price Across Category')

plt.show();

Faceting

  • Helps Exploration

  • Comparisons made easier

  • Clarity (can reduce overplotting)

  • Not specific to one industry/field

Faceting Boxplots in seaborn

Faceting Boxplots with Seaborn

top_categories = ikea['category'].unique()[:4]

new_data = ikea[ikea['category'].isin(top_categories)]

g = sns.catplot(x='category', y='price', hue = 'category',
col='sellable_online',
                data=new_data, kind='box',
                height=4, aspect=1.5)

g.set_xticklabels(rotation=45, ha = 'right')

plt.show()

Notes on Faceting

  • In Pandas and in Matplotlib, faceting would require first creating subplots (one for each facet)

  • Then process the categories in each facet, so in the Ikea example we would filter the DataFrame for a category, and then plotting the category in its designated subplot.

  • If you were to avoid Seaborn, choose Matplotlib over Pandas

Just curious

I asked ChatGPT how to facet the Pandas boxplot. Here is its response. Suppose we had regions in the Ikea data:

# Unique regions to facet by
regions = ikea['region'].unique()

# Create a figure with a subplot for each region
fig, axes = plt.subplots(nrows=1, ncols=len(regions), figsize=(5 * len(regions), 5), sharey=True)

# Loop over each region and create a boxplot in the corresponding subplot
for ax, region in zip(axes, regions):
    subset = ikea[ikea['region'] == region]
    subset.boxplot(column='price', by='category', ax=ax)
    ax.set_title(f'Region: {region}')
    ax.set_xlabel('Category')
    ax.set_ylabel('Price')
    ax.tick_params(axis='x', rotation=45)  # Optional: Rotate x-axis labels for clarity

# Adjust layout and display the plot
plt.tight_layout()
plt.suptitle('Price Distribution by Category and Region')  # Set the overall title
plt.subplots_adjust(top=0.85)  # Adjust the top margin to fit the suptitle
plt.show()

More faceting with Seaborn

Code: More faceting with Seaborn

# FacetGrid in Seaborn

g = sns.FacetGrid(data=ikea, col= 'other_colors', height=4, aspect=1.5)
g.map(plt.hist, "price", bins=20, color='b')

plt.show()

Plotnine

  • https://plotnine.org/

  • ggplot in Python

  • about 100 developers and about 200 contributors

  • still very new (about a year ago) with many issues (e.g., interacting with Matplotlib)

ggplot in Python with plotnine

Code: ggplot in Python with plotnine

(
    ggplot(mtcars, aes("wt", "mpg", color="factor(gear)"))
    + geom_point()
    + stat_smooth(method="lm")
    + facet_wrap("gear")
    + theme_tufte()
)